DATA 602
PK O'Flaherty
December 5, 2022
The purpose of the data project is for you to conduct an analysis with a dataset of your choosing.
Project Analysis Sections:

Picture of an abalone from the National Oceanic and Atmospheric Administration's website
Photo: NOAA Fisheries/Michael Ready
In this google colab notebook analysis we take a popularly downloaded dataset from the UCI Machine Learning Repository and, in python, perform common exploratory data analysis and preprocessing steps before running a K-Nearest Neighbors Classifier on the data to identify whether an individual abalone is a Male, Female or Infant based on measurements including dimensions, weights and shell ring counts indicating age.
This analysis would be of interest to people starting a career in data science or data analysis, and those interested in marine biology and environmental sustainability.
Typically this dataset has been used to predict the age of an abalone without having to slice and stain the shell for counting under a microscope, however we are turning it into a machine learning classification problem to identify sex or sexual immaturity from the already collected data.
Ultimately the problem was not tractable under KNN classification, regardless of number of neighbors hyperparameter. This indicates abalone have low sexual dimorphism where the males and females are roughly the same size at a given age, and sexual maturity is not entirely dependent on age, but it might be triggered by an abundance of food since infants tended to have a lower growth in weight per year than sexual mature individuals.
It is easy to identify an abalone's sex by looking between the foot and shell. The skin and guts are a blue-gray-green color and in sexually mature males there's a large amount of sperm visible as white, through the skin, next to the guts.
However can a machine reliably identify an abalone as male, female or sexually immature based on measurements?
We intend to address this question by processing and exploring the data, training a machine learning classification across multiple hyperparameters, and scoring the classification models to identify a suitable model.
The data comes from the UCI Machine Learning Repository Abalone Data Set and was originally sourced from the following study:
Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288)
The data contains 4,177 records with 8 attributes and no missing values. The six continuous attributes have been scaled for use with an Artificial Neural Network by dividing by 200.
Attributes summary:
| Data | Type | Info | Example (7th Record) |
|---|---|---|---|
| Sex | Group | Male, Female or Infant | F |
| Length | Float | mm | 0.530 |
| Diameter | Float | mm | 0.415 |
| Height | Float | mm | 0.150 |
| Whole weight | Float | g | 0.7775 |
| Shucked weight | Float | g | 0.2370 |
| Viscera weight | Float | g | 0.1415 |
| Shell weight | Float | g | 0.330 |
| Rings | Integer | count | 20 |
Code Summary:
### Load libraries
# core
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# ml
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# graphing
import plotly.express as px
### Load data
df = pd.read_csv('https://raw.githubusercontent.com/pkofy/DATA602/main/FinalProject/abalone.data', header=None)
# We read the data from github for reproducibility after having uploaded the data to github and credited it there
We performed an exploratory data analysis in the project proposal so we are doing data wrangling first, which will set us up to do a more indepth exploratory data analysis in the next section.
The continuous data is listed in 1/200ths millimeters or grams so we are undoing that.
The age of an abalone is 1.5 plus the number of rings that they have.
When we did the exploratory data analysis in the proposal the weight ranges for individual abalones was so large and overlapping between the different groups that we need new growth columns to represent how much per year on average the abalone grew in total length or weight.
Code summary:
# Label columns
df.columns = ['sex', 'length', 'diameter', 'height', 'whole_weight', 'shucked_weight',
'viscera_weight', 'shell_weight', 'rings']
# Undo scaling on the continuous columns by multiplying by 200
df.iloc[:,1:8] = df.iloc[:,1:8]*200
# Create 'age' column from 'rings' column by adding 1.5
df['age'] = df.rings + 1.5
# Create growth columns, 'length_per_year' and 'weight_per_year', based on 'length', 'whole weight' and 'age'
df['length_per_year'] = df.length / df.age
df['weight_per_year'] = df.whole_weight / df.age
# Relabel the 'sex' column to be the whole word
df.sex = df.sex.replace(['M', 'F', 'I'], ['Male', 'Female', 'Infant'])
When we completed our initial exploratory data analysis for the proposal it was clear that there was a lot of overlap in the length and whole weights for the abalone. Here we look into that more closely as well as compare the average growth in length and weight per year to try and factor out age.
Code Summary:
### Figure 1 - Length versus Summed Weight by Sex (Bar Chart)
fig1 = px.bar(df, x='length', y='whole_weight', color='sex', barmode='group',
labels={"length": "Length (mm)",
"whole_weight": "Summed Weight (g)",
"sex": "Sex"},
title="Figure 1 - Length versus Whole Weight by Sex (Bar Chart)")
fig1.show(renderer='notebook')
From Figure 1 we can see that Males and Females overlap in terms of length and whole weight fairly closely. This means that we're not likely to be able to classify between male and female based on these parameters alone.
There is a different distribution of length and whole weight for the infants however there is enough overlap that it may be difficult to classify between sexually mature and immature individuals as well.
### Figure 2 - Length versus Weight by Sex (Scatter Plot)
fig2 = px.scatter(df, x='length', y='whole_weight', color='sex', facet_col='sex',
labels={"length": "Length (mm)",
"whole_weight": "Weight (g)",
"sex": "Sex"},
title="Figure 2 - Length versus Whole Weight by Sex (Scatter Plot)")
fig2.show(renderer='notebook')
In Figure 2 we see again that Males and Females overlap in weight and length distribution. However with the scatterplot faceted by sex it becomes clear that infants tend to not put on as much weight per length as sexually mature individuals.
### Figure 3 - Length per year by Sex
fig3 = px.box(df, x="sex", y="length_per_year", color="sex",
labels={"sex": "Sex",
"length_per_year": "Length/year (mm)"},
title="Figure 3 - Length per year by Sex")
fig3.show(renderer='notebook')
In Figure 3 we can see that all three classes of abalone tend to grow at the same length per year.
### Figure 4 - Weight per year by Sex
fig4 = px.box(df, x="sex", y="weight_per_year", color="sex",
labels={"sex": "Sex",
"weight_per_year": "Weight/year (g)"},
title="Figure 4 - Weight per year by Sex")
fig4.show(renderer='notebook')
In Figure 4, we can confirm that the infants tend to have a lower growth in weight per year than sexually mature individuals.
In the machine learning analysis section we attempt to classify the sex and sexual maturity from the measurements.
Initially we try to classify sex based on the remaining features (with a 75/25 train/test-split and 5 neighbors) but only have an accuracy of 55.69%.
Then we repeat the classification but only looking at records that are Male or Female (excluding Infant records) and achieve a similar accuracy of 54.72% which is equivalent to guessing.
We try the classification again but combining Male and Female into one class, 'Mature', and have an improved accuracy of 82.20%.
Code Summary:
### 1. KNN classification for Male, Female or Infant
# Split the dataframe into numpy arrays grouped into features (X) and target (Y) values
X = df.iloc[:,1:9].to_numpy()
y = df.iloc[:,0:1].to_numpy()
# Split the iris data into 75% train and 25% test sets
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.25, random_state=100)
# We can fit and transform the X training set in one step
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Fit the 5-neighbor kNN model to the training data
kn5 = KNN(n_neighbors=5)
kn5.fit(X_train, y_train.ravel())
# Predict the targets for the test data
y_pred = kn5.predict(X_test)
# Return our accuracy on the test data
print("Test accuracy (M/F/I) with n_neighbors=5: {:.2f}%".format(accuracy_score(y_test, y_pred)*100))
While this result is better than guessing since there are three classes, let's split the problem into comparing between Male and Female, and Sexually Mature and Immature to see where the prediction success if coming from.
Note that for the data preprocessing we only include the original columns since the three created columns are linearly dependent on the original columns.
### 2. KNN classification for Male or Female
df_MF = df.loc[df['sex'] != 'Infant']
# Split the dataframe into numpy arrays grouped into features (X) and target (Y) values
X_MF = df_MF.iloc[:,1:9].to_numpy()
y_MF = df_MF.iloc[:,0:1].to_numpy()
# Split the iris data into 75% train and 25% test sets
X_train_MF, X_test_MF, y_train_MF, y_test_MF = tts(X_MF, y_MF, test_size=0.25, random_state=100)
# We can fit and transform the X training set in one step
scaler = StandardScaler()
X_train_MF = scaler.fit_transform(X_train_MF)
X_test_MF = scaler.transform(X_test_MF)
# Fit the 5-neighbor kNN model to the training data
kn5 = KNN(n_neighbors=5)
kn5.fit(X_train_MF, y_train_MF.ravel())
# Predict the targets for the test data
y_pred_MF = kn5.predict(X_test_MF)
# Return our accuracy on the test data
print("Test accuracy between Male and Female with n_neighbors=5: {:.2f}%".format(accuracy_score(y_test_MF, y_pred_MF)*100))
Since we are looking at only two classes when comparing between Male and Female, a result close to 50% is equivalent to guessing. It appears a machine can't distinguish between Males and Females based on these measurements.
### 3. KNN classification for Sexually Mature or Infant
df_MI = df
df_MI.sex = df_MI.sex.replace(['Male', 'Female'], ['Mature', 'Mature'])
# Split the dataframe into numpy arrays grouped into features (X) and target (Y) values
X_MI = df_MI.iloc[:,1:9].to_numpy()
y_MI = df_MI.iloc[:,0:1].to_numpy()
# Split the iris data into 75% train and 25% test sets
X_train_MI, X_test_MI, y_train_MI, y_test_MI = tts(X_MI, y_MI, test_size=0.25, random_state=100)
# We can fit and transform the X training set in one step
scaler = StandardScaler()
X_train_MI = scaler.fit_transform(X_train_MI)
X_test_MI = scaler.transform(X_test_MI)
# Fit the 5-neighbor kNN model to the training data
kn5 = KNN(n_neighbors=5)
kn5.fit(X_train_MI, y_train_MI.ravel())
# Predict the targets for the test data
y_pred_MI = kn5.predict(X_test_MI)
# Return our accuracy on the test data
print("Test accuracy between Mature and Immature with n_neighbors=5: {:.2f}%".format(accuracy_score(y_test_MI, y_pred_MI)*100))
As expected we were able to classify between sexually mature and immature abalone at a higher rate, which means that the mild sucess we had in classifying the abalone in the first attempt was entirely due to successfully classifying infant abalone.
Looking back at the original classification problem let's try a range of N-Neighbors to see if we can improve the result.
### 4. KNN Hyperparameter testing for number of neighbors
# Split data into training and testing data sets
X_train_HP, X_test_HP, y_train_HP, y_test_HP = tts(X, y, test_size=0.25, random_state=100)
# Create neighbors from 1 to 30
neighbors = np.arange(1,31)
train_accuracies = {}
test_accuracies = {}
# Run the model and capture training and test accuracies
for neighbor in neighbors:
knn = KNN(n_neighbors=neighbor)
knn.fit(X_train_HP, y_train_HP.ravel())
train_accuracies[neighbor] = knn.score(X_train_HP, y_train_HP.ravel())
test_accuracies[neighbor] = knn.score(X_test_HP, y_test_HP.ravel())
# Turn testing accuracies into a dataframe
atest = np.array(list(test_accuracies.items()))
df_test = pd.DataFrame(atest, columns = ['k', 'Accuracy'])
df_test['Type'] = 'Testing Accuracy'
# Turn training accuracies into a dataframe
atrain = np.array(list(train_accuracies.items()))
df_train = pd.DataFrame(atrain, columns = ['k', 'Accuracy'])
df_train['Type'] = 'Training Accuracy'
# Combine them, and rename the second 'k' column so you can delete it
df_graph = pd.concat([df_test, df_train])
# Graph the overfitting/underfitting curve
fig1 = px.line(df_graph, x='k', y='Accuracy', color='Type', title=
'kNN Accuracy by k in range 1-30')
fig1.show(renderer='notebook')
The bands of N-Neighbors that seem to have a meaningful difference are: 1-4: lower success in testing 5-23: slightly higher success in testing 24+: slightly higher success but no meaningful change in the problem
Ultimately our machine learning model was not able to reliably identify an abalone as male, female or sexually immature based on the given measurements.
Distinguishing between male and female was roughly equivalent to guessing. Very high values of K-Neighbors (24 or more) were associated with a higher score (60% accuracy versus 55% accuracy) however that was also at the convergence of training and testing accuracies and it may be that at high numbers of K-Neighbors, improvements in the model start to become overfitting.
It could be that there is low sexual dimorphism among abalone, meaning the males and females are roughly the same size. We also don't have information on whether the abalone travel large distances in their life and could be subject to different local water temperatures, food types or availability, or population pressures. For example, goldfish secrete a chemical that slows growth when the surrounding water has a high enough concentration of the chemical, similarly abalone could influence each other to grow more slowly when there's lots of abalone in one area.
Distinguishing between sexually mature and immature individuals was roughly 80% accurate. I would have expected a higher accuracy but it may be that sexual maturity isn't triggered at a specific age or size attainment. For example a clownfish in a group will turn female if the female in the group dies. It could be that environmental or community factors trigger sexual maturation in abalone. One clear difference is that the infants tend to have a lower whole weight per year than sexual mature abalone, so it may be that abundance of food triggers sexual maturation in abalone.
There was a hint of our results from a line in the UCI Machine Learning Repository: "Further information, such as weather patterns and location (hence food availability) may be required to solve the problem."
Options to extend this project for further work could be to:
LogisticRegression from sklearn.linear_model or another classifierLinearRegression from sklearn.linear_model or a neural network to include a second machine learning section predicting a continuous target, possibly to predict total weight based on age and sex/sexual maturity, or, possibly to predict age based on collected features as the data was originally intended